Week 13a: Data Scraping

Data Scraping

Data Scraping

Data scraping is defined as using a computer to extract information, typically from human readable websites. We could spend multiple weeks on this, so this will be a basic introduction that will allow you to:

  • extract text and numbers from webpages and
  • extract tables from webpages.

A bit about HTML

HTML elements are written with a start tag, an end tag, and with the content in between: content. The tags which typically contain the textual content we wish to scrape. Some tags include:

  • \(<h1>\), \(<h2>\),…,: for headings
  • \(<p>\): Paragraph elements
  • \(<ul>\): Unordered bulleted list
  • \(<ol>\): Ordered list
  • \(<li>\): Individual List item
  • \(<div>\): Division or section
  • \(<table>\): Table

HTML Example

My Website

HTML Example

My Website

Scraping with rvest

Scraping with rvest

## {xml_nodeset (1)}
## [1] <h1>Andrew Hoegh </h1>
## [1] "Andrew Hoegh "
## [1] "Andrew Hoegh"               "\n                        "
## [3] "Education"                  "Contact Information"       
## [5] "Research Interests"         "Teaching"
## [1] " Site Menu expand"                                                
## [2] "Statistics Faculty"                                               
## [3] "Assistant Professor of Statistics"                                
## [4] "Andrew HoeghMontana State UniversityBozeman, MT 59717"            
## [5] "Office: Wilson 2-241Tel: (406) 994-5340andrew.hoegh @ montana.edu"
## [6] "Located in Bozeman, MT"                                           
## [7] "For questions or comments contact the Ask Us Desk."

Scraping with rvest

## [1] " Site Menu expand"                                                
## [2] "Statistics Faculty"                                               
## [3] "Assistant Professor of Statistics"                                
## [4] "Andrew HoeghMontana State UniversityBozeman, MT 59717"            
## [5] "Office: Wilson 2-241Tel: (406) 994-5340andrew.hoegh @ montana.edu"
## [6] "Located in Bozeman, MT"                                           
## [7] "For questions or comments contact the Ask Us Desk."

Tidying Up

## [1] "Andrew Hoegh"             "                        "
## [3] "Education"                "Contact Information"     
## [5] "Research Interests"       "Teaching"

Scraping li to find email address

##  [1] ""                                                       
##  [2] ""                                                       
##  [3] ""                                                       
##  [4] "Search"                                                 
##  [5] "Skip Navigation"                                        
##  [6] "Andrew Hoegh"                                           
##  [7] "Teaching"                                               
##  [8] "Research Interests"                                     
##  [9] "CV"                                                     
## [10] "Department of Mathematical Sciences"                    
## [11] "Andrew Hoegh"                                           
## [12] "Ph.D. (2016) Virginia Tech, Blacksburg, VA"             
## [13] "M.S. (2008) Colorado School of Mines, Golden, CO"       
## [14] "B.A. (2006) Luther College, Decorah, IA"                
## [15] "Phone: (406) - 994-5340"                                
## [16] "Email: andrew.hoegh @ montana.edu"                      
## [17] "Bayesian statistics"                                    
## [18] "Statistical Ecology"                                    
## [19] "Spatiotemporal Modeling"                                
## [20] "Computational Statistics"                               
## [21] "Sports Analytics"                                       
## [22] "Applied environmental and ecological research"          
## [23] "STAT 532 - Bayesian Statistics"                         
## [24] "STAT 491 - Intro to Bayesian Stats"                     
## [25] "STAT 446 - Sampling"                                    
## [26] "STAT 436/536 - Time Series"                             
## [27] "STAT 408 - Statistical Computing and Graphical Analysis"
## [28] "More Information"                                       
## [29] "Admissions"                                             
## [30] "Current Students"                                       
## [31] "Faculty & Staff"                                        
## [32] "Parents & Family"                                       
## [33] "Alumni"                                                 
## [34] "Resources"                                              
## [35] "Accessibility"                                          
## [36] "Contact List"                                           
## [37] "Directories"                                            
## [38] "Jobs"                                                   
## [39] "Legal & Privacy Policy"                                 
## [40] "Site Index"                                             
## [41] "Follow Us"                                              
## [42] "Facebook Twitter YouTube Instagram LinkedIn"

Scraping li to find email address

## [1] "Email: andrew.hoegh @ montana.edu"

A River Runs Through It

IMDB: A River Runs Through It

Get Movie Title

The movie title is A River Runs Through It (1992) - IMDb.

Get Story line

The storyline is : The Maclean brothers, Paul and Norman, live a relatively idyllic life in rural Montana, spending much of their time fly fishing. The sons of a minister, the boys eventually part company when Norman moves east to attend college, leaving his rebellious brother to find trouble back home. When Norman finally returns, the siblings resume their fishing outings, and assess both where they’ve been and where they’re going. Written byJwelch5742 .

Scraping Tables

http://www.montana.edu/marketing/about-msu/

2019 / 2020 Resident Nonresident
Tuition/Fees $7,320 $25,850
Room/Board $10,300 $10,300
Books/Supplies $1,450 $1,450
Total Estimated Cost $19,070 $37,600